Regression

🏠 ⮐ Artificial Intelligence ⮐ Machine Learning ⮐ Supervised Learning ⮐

Core Concept

Regression is a supervised learning task where the goal is to predict a continuous numerical value (or vector of values) from input data. Given labeled training examples—pairs of features and continuous targets—a regression model learns a function that maps inputs to outputs on a numerical scale, enabling it to estimate the value of the target variable for new, unlabeled instances. Unlike classification, which predicts discrete categories, regression produces quantitative outputs: real-valued predictions that can be interpreted as magnitudes, amounts, or measurements. The learned function may be linear (a hyperplane in the feature space) or non-linear (curves, surfaces, or more complex mappings), depending on the model family and the assumed relationship between inputs and target.

The Learning Process

The regression process involves training a model on examples where both features (input variables) and target values (continuous outputs) are known. The model learns parameters that minimize a loss function measuring prediction error—typically squared error (MSE) for Gaussian assumptions or absolute error (MAE) for robustness to outliers. Learning algorithms adjust weights via closed-form solutions (e.g. normal equation for linear regression) or iterative optimization (e.g. gradient descent), possibly with regularization to control complexity. During inference, the trained model receives new inputs and outputs a single point prediction (and optionally prediction intervals or uncertainty estimates) for the target variable.

Model Forms

_{How regression models represent the relationship between features and target}

Linear models – Assume the target is a linear combination of features (plus intercept); interpretable, fast, and well-suited when the true relationship is approximately linear or when simplicity and stability are preferred over flexibility
Polynomial and basis expansions – Extend linear models by including powers or other basis functions of features, capturing curvature and non-linear trends within a parametric, interpretable framework; degree controls flexibility and overfitting risk
Non-linear and non-parametric models – Tree-based methods, kernel regression, and neural networks can learn arbitrarily complex mappings; they offer greater flexibility but require more data and care to avoid overfitting and to interpret
Regularization – Ridge (L2), Lasso (L1), and Elastic Net penalize coefficient magnitude or sparsity, shrinking or selecting features to improve generalization when features are many or correlated

Prediction Types

_{Different forms of output that regression models can produce}

Point prediction – Single numerical value per instance; the standard output for most regression models, interpretable and directly usable for downstream decisions
Prediction intervals – Range within which the true value is expected to fall with a given probability; requires uncertainty quantification (e.g. from residual distribution, Bayesian posteriors, or conformal prediction) and is valuable when knowing the range of plausible outcomes matters
Probabilistic forecasts – Full predictive distribution over the target (e.g. Gaussian with mean and variance); supports decision-making under uncertainty and custom loss functions

Evaluation Metrics

_{Methods for measuring and comparing regression model performance}

Mean Squared Error (MSE) – Average of squared differences between predicted and actual values; differentiable and penalizes large errors heavily; in the same units as squared target
Root Mean Squared Error (RMSE) – Square root of MSE; in the same units as the target, so directly interpretable as typical magnitude of error
Mean Absolute Error (MAE) – Average of absolute differences; robust to outliers and interpretable as average absolute deviation
R² (coefficient of determination) – Proportion of variance in the target explained by the model; 0 means no better than predicting the mean, 1 means perfect fit; comparable across models on the same dataset
Adjusted R² – R² penalized for number of predictors; discourages adding irrelevant features and supports model comparison when complexity differs
Cross-validation – Hold-out or K-fold evaluation of MSE/RMSE/MAE/R² on unseen folds to estimate generalization and tune complexity (e.g. polynomial degree, regularization strength)

Common Challenges

Overfitting occurs when the model fits training noise rather than the underlying relationship, especially with many features or flexible forms (high-degree polynomials, deep trees); mitigation includes regularization, cross-validation, and limiting model complexity. Outliers can distort least-squares estimates and metrics like MSE; robust loss functions (e.g. MAE, Huber) or outlier handling may be needed. Heteroscedasticity (non-constant variance of errors) violates assumptions of standard inference and interval construction; weighted least squares or transformation of the target can help. Multicollinearity among features inflates variance of coefficient estimates and complicates interpretation; regularization or feature selection can stabilize estimates. Non-linearity when the true relationship is curved or interactive may require polynomial terms, splines, or non-linear models to avoid systematic bias. Extrapolation beyond the range of training data is unreliable for most models; predictions should be restricted to the supported input range or accompanied by appropriate uncertainty.

Sub-types

Regression sub-types are distinguished by the form of the learned function (linear vs polynomial vs other), the number of targets (univariate vs multivariate), and the presence of constraints or structure (e.g. regularization, time series). This overview focuses on two foundational parametric forms.

Linear Regression – Models the target as a linear combination of features; the baseline regression approach, interpretable and well-understood, suitable when the relationship is approximately linear.
Polynomial Regression – Extends linear regression with polynomial terms (powers of features) to capture curvature and non-linear trends while remaining in a parametric, interpretable family.
Ridge Regression
Lasso Regression